KCAT: A Korean Corpus Annotating Tool Minimizing Human Intervention
نویسندگان
چکیده
While large POS(part-of-speech) annotated corpora play an important role in natural language processing, the annotated corpus requires very high accuracy and consistency. To build such an accurate and consistent corpus, we often use a manual tagging method. But the manual tagging is very labor intensive and expensive. Furthernaore, it is not easy to get consistent results from the humari experts. In this paper, we present an efficient tool lbr building large accurate and consistent corpora with minimal human labor. The proposed tool supports semiautomatic tagging. Using disambiguation rules acquired from human experts, it minimizes the human intervention in both the manual tagging and post-editing steps.
منابع مشابه
NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus
The NTU-MC Toolkit is a compilation of tools to annotate the Nanyang Technological University Multilingual Corpus (NTU-MC). The NTU-MC is a parallel corpora of linguistically diverse languages (Arabic, English, Indonesian, Japanese, Korean, Mandarin Chinese, Thai and Vietnamese). The NTU-MC thrives on the mantra of "more data is better data and more annotation is better information". Other than...
متن کاملTESLA: A Tool for Annotating Geospatial Language Corpora
In this paper, we present The gEoSpatial Language Annotator (TESLA)—a tool which supports human annotation of geospatial language corpora. TESLA interfaces with a GIS database for annotating grounded geospatial entities and uses Google Earth for visualization of both entity search results and evolving object and speaker position from GPS tracks. We also discuss a current annotation effort using...
متن کاملAnnotating Korean Demonstratives
This paper presents preliminary work on a corpus-based study of Korean demonstratives. Through the development of an annotation scheme and the use of spoken and written corpora, we aim to determine different functions of demonstratives and to examine their distributional properties. Our corpus study adopts similar features of annotation used in Botley and McEnery (2001) and provides some lingui...
متن کاملArabic anaphora resolution: corpora annotation with coreferential links
Annotated resources are much needed for evaluation and training of anaphora resolution systems. The coreferential chain annotation is a difficult task which can not be realised without an appropriate tool. In this paper, we present our work on Arabic corpora annotation with anaphoric links (i.e., the annotation of the identity relation between the anaphors and their antecedents). In particular,...
متن کاملCorpus building for Mongolian language
This paper presents an ongoing research aimed to build the first corpus, 5 million words, for Mongolian language by focusing on annotating and tagging corpus texts according to TEI XML (McQueen, 2004) format. Also, a tool, MCBuilder, which provides support for flexibly and manually annotating and manipulating the corpus texts with XML structure, is presented.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000